Complete Developer Guide · 2025

Run AI Locally.
Own Your Intelligence.

A complete, beginner-friendly guide to Ollama — from what it is, how to install it, to using it with harness tools like Codex CLI. No cloud, no API keys, no surveillance.

Open Source Local LLMs CPU & GPU No Cloud Needed
bash — ~/Desktop/ai-local
user@machine $ ollama run gemma3
> pulling model... ████████████ 100%
> model loaded. 0 cloud calls made.
user@machine $ ollama list
NAME           SIZE     MODIFIED
gemma3:latest   5.0 GB   2 hours ago
user@machine

Your laptop as an AI server

Ollama is an open-source tool that lets you run large language models (LLMs) directly on your own computer — no internet required, no API keys, no subscription. Think of it as a download manager + inference engine for AI models, wrapped in a simple command-line interface.

When you run a model with Ollama, your CPU (and GPU if available) do all the heavy lifting. Your prompts never leave your machine. This is called on-device inference.

Ollama also exposes a local REST API on port 11434, meaning any tool that knows how to talk to an HTTP endpoint can use it — editors, scripts, harness tools like Codex CLI, and more.

📡 Key Insight

Ollama is not an AI model itself. It's the runtime — the engine that loads, manages, and serves open-weight models like Gemma, Llama, Mistral, Phi, and others.

💻
Your App
Terminal, editor, script, browser
⚙️
Ollama
REST API :11434 model manager
🧠
LLM Model
Gemma, Llama, Mistral, Phi...
🖥️
Your CPU/GPU
100% local inference

How we got here

The ability to run LLMs locally didn't happen overnight. It's the result of years of research breakthroughs, open-source activism, and clever engineering.

2017
Transformers Paper — "Attention Is All You Need"
Google researchers published the transformer architecture that would become the backbone of every modern LLM. GPT, BERT, Llama — they all descend from this paper.
2020–2022
GPT-3 & The Closed AI Era
OpenAI released GPT-3 — powerful, but locked behind an API. Running it locally was impossible. The community began pushing for open alternatives.
Feb 2023
Meta releases LLaMA — the turning point
Meta's LLaMA model weights leaked publicly. For the first time, researchers could run a powerful language model on consumer hardware. Llama.cpp followed days later — a pure C++ inference engine that ran on laptops.
Mid 2023
Quantization unlocks consumer hardware
GGUF quantization (4-bit, 5-bit) compressed models dramatically. A 7B parameter model that needed 14GB RAM now ran in under 5GB. Regular laptops could now run capable AI.
July 2023
🦙 Ollama launches (v0.1)
Built on top of llama.cpp, Ollama wrapped local model execution in a beautiful, Docker-inspired interface with simple commands (ollama run, ollama pull) and a REST API. It dramatically lowered the barrier to entry.
2024
Ecosystem explodes — Gemma, Mistral, Phi, Qwen
Google (Gemma), Microsoft (Phi), Mistral AI, and Alibaba (Qwen) all released competitive open-weight models. Ollama's model library grew to 100+ models. GPU acceleration support expanded for NVIDIA, AMD, and Apple Silicon.
2025
Harness tools proliferate — Codex CLI, Claude Code, Continue
AI coding assistants began supporting Ollama as a backend. You could now run a coding agent entirely offline, using your own hardware and models with zero cloud dependency.

Breaking it down

Complex technology should be explainable at every level. Here's Ollama explained two ways:

🧒 Explain Like I'm 8 (ELI8)
🧸

It's like downloading a smart toy to your room

You know how Siri and Alexa live in the internet and need Wi-Fi to answer you? Ollama is like downloading a really smart robot brain onto your computer, so it lives in your bedroom.

Once it's there, you can talk to it and ask it questions — even if your Wi-Fi is off! It doesn't tell anyone what you said, because it never goes to the internet. It's your private robot helper.

Ollama is the tool that helps put those robot brains on your computer. The brains are called "models" — they're like different toys you can download, each one good at different things.

👦 Explain Like I'm 10 (ELI10)
🔬

Running AI like a local game server

You know how in Minecraft you can play on a server with friends, but you can also start your own local server on your computer? Ollama is like setting up your own private AI server.

Big AI tools like ChatGPT run on huge computers in data centers — you're basically borrowing their power. With Ollama, you download the AI "brain" (called a model) to your own PC, and your computer does all the thinking.

The cool parts: it works offline, no one can see your chats, and you can try different models like Gemma or Llama — kind of like switching between different game characters, each with different abilities.

🎓 Developer Definition

Ollama is a locally-run model inference server built on llama.cpp, that provides a Docker-like CLI for pulling, running, and managing open-weight language models. It exposes an OpenAI-compatible REST API at localhost:11434, enabling drop-in replacement for cloud APIs in development workflows.

Step-by-step installation

Generic recommended specs for running models locally

🖥️
CPU
8+ cores
x86-64 or ARM (Apple Silicon)
🧠
RAM
16 GB+
More = bigger models
💾
Disk
50 GB+
Models range 2GB – 30GB+
🎮
GPU (Optional)
4 GB+ VRAM
NVIDIA/AMD/Intel Iris/Apple GPU
Hardware Tier RAM Models You Can Run Speed Status
Budget Laptop 8 GB gemma2:2b, phi3:mini, tinyllama Slow (2–5 tok/s) Works with patience
Mid-range Laptop 16 GB gemma3:4b, llama3.2:3b, mistral:7b OK (5–15 tok/s) Good daily driver
Gaming PC / M-series Mac 32 GB llama3.1:8b, qwen2.5:14b, gemma3:12b Fast (15–50 tok/s) Excellent
Workstation / Mac Studio 64 GB+ llama3.1:70b, qwen2.5:72b, deepseek Very fast Production grade

Install Ollama

Download and install Ollama for your platform.

# macOS / Linux — one-liner
curl -fsSL https://ollama.com/install.sh | sh

# Windows — Download installer from:
https://ollama.com/download/windows

# Verify installation
ollama --version

After install, Ollama runs as a background service and listens on localhost:11434.

Pull your first model

Choose a model from the Ollama library. For beginners, Gemma3 or Llama3.2 are excellent starting points.

# Pull a model (downloads to ~/.ollama/models)
ollama pull gemma3

# Or pull a specific size variant
ollama pull gemma3:4b
ollama pull llama3.2:3b
ollama pull mistral:7b

# List all downloaded models
ollama list

Run the model

# Interactive chat mode
ollama run gemma3

# Single prompt mode
ollama run gemma3 "Explain recursion in simple terms"

# Check what's running
ollama ps

# Stop a loaded model
ollama stop gemma3

Use the API directly

Ollama exposes an OpenAI-compatible REST API. Any tool that supports OpenAI can point to Ollama instead.

# Basic API call with curl
curl http://localhost:11434/api/generate \
  -d '{
    "model": "gemma3",
    "prompt": "What is Ollama?",
    "stream": false
  }'

# OpenAI-compatible endpoint
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemma3",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

Advanced: Create a custom Modelfile

A Modelfile is like a Dockerfile for AI models — define a system prompt, temperature, and parameters.

# Create a file called "Modelfile"
FROM gemma3

# Set a system prompt
SYSTEM """
You are a senior software engineer who gives
concise, accurate code reviews. Use markdown.
"""

# Tune parameters
PARAMETER temperature 0.3
PARAMETER num_ctx 8192

# Build and run your custom model
ollama create myreviewer -f Modelfile
ollama run myreviewer

Environment variables & configuration

# Change model storage location (default: ~/.ollama)
export OLLAMA_MODELS=/path/to/models

# Allow external access (LAN / other machines)
export OLLAMA_HOST=0.0.0.0:11434

# Set GPU layers (tune for your VRAM)
export OLLAMA_NUM_GPU=35

# Set number of parallel requests
export OLLAMA_NUM_PARALLEL=2

# Keep model loaded in memory (seconds)
export OLLAMA_KEEP_ALIVE="10m"
⚠️ Performance tip

Set OLLAMA_KEEP_ALIVE="-1" to keep the model permanently loaded. This eliminates the cold-start delay between prompts at the cost of persistent RAM usage.

What models can you run?

Ollama's library hosts 100+ models. Here are the most popular for different use cases:

gemma3
by Google · 1B / 4B / 12B / 27B
Excellent all-rounder. Great for coding, reasoning, and chat. Very efficient at smaller sizes.
llama3.2
by Meta · 1B / 3B
Fast and lightweight. Best for constrained hardware. Ideal for quick tasks and embedding.
mistral
by Mistral AI · 7B
Strong at instruction-following and reasoning. Great balance of size and capability.
qwen2.5-coder
by Alibaba · 1.5B / 7B / 32B
Specialized for code. Excellent at completion, refactoring, and debugging tasks.
phi4
by Microsoft · 14B
Punches above its weight. Excellent reasoning per GB of model size.
deepseek-r1
by DeepSeek · 8B / 32B / 70B
Chain-of-thought reasoning model. Excellent for math, logic, and analytical tasks.
💡 Pro tip

Use ollama search <keyword> to find models, or browse ollama.com/library. Model names with no tag default to :latest. Use specific tags like gemma3:4b-it-q4_K_M to control quantization level.

The honest trade-offs

Privacy
Local: 100% Private
Zero data leaving device
Speed (7B model)
~10 tok/s CPU
Cloud: ~80 tok/s
Cost (ongoing)
$0 / month after setup
Cloud: $20–$200+/mo
Model Quality
Good (smaller models)
Cloud: Best (GPT-4, Claude)
Offline Use
Yes — works anywhere
Cloud: Needs internet

Advantages

  • Complete privacy — prompts never leave your machine
  • No recurring costs after hardware investment
  • Works offline — planes, remote locations, no Wi-Fi
  • No rate limits or context window throttling
  • Fully customizable via Modelfiles and system prompts
  • OpenAI-compatible API — drop-in for existing tools
  • Run multiple models simultaneously
  • No censorship or content filtering from providers
  • Educational — understand how LLMs actually work

Shortcomings

  • Slower than cloud — CPU inference is notably slower than datacenter GPUs
  • Hardware ceiling — larger, smarter models need more RAM/VRAM
  • High RAM usage — a 7B model alone uses 5–8GB RAM
  • Model quality gap — local 7B models lag behind GPT-4 or Claude Opus
  • CPU heat and battery drain on laptops
  • Initial model downloads are large (2–30 GB per model)
  • No internet-connected tools (web search) without extra setup
  • Limited multimodal capability at smaller sizes

Ollama as an AI backend

Ollama's real superpower is acting as the engine for other tools. "Harness tools" are CLIs and editors that sit on top of a model API and give it agentic capabilities — like browsing files, writing code, and running commands.

OpenAI / Anthropic

Codex CLI

OpenAI's official CLI coding agent. Designed to work with GPT-4o, but supports any OpenAI-compatible endpoint — including Ollama. It scans your codebase, understands context, and can write, edit, and run code.

Supports Ollama CPU Intensive
Anthropic

Claude Code

Anthropic's agentic coding tool. Primarily uses Claude API, but can be configured to point to local models via OpenAI-compatible endpoints. Excellent for large codebases with its extended thinking capability.

Experimental Local Best with Cloud
Open Source

Continue.dev

VS Code / JetBrains extension for AI coding. First-class Ollama support. Provides autocomplete, chat, and edit modes, all powered by your local models.

Native Ollama
Community

Open WebUI

A ChatGPT-like browser interface that connects directly to Ollama. Run it locally and get a full-featured chat UI with history, RAG, and model switching — all offline.

Native Ollama

Using Codex CLI with Ollama + Gemma

Running an open-weight coding agent entirely on local hardware — zero cloud calls, full source-code privacy. Here's exactly what happens, step by step.

Step 1 Start Ollama and load your model
$ ollama pull gemma3:4b
Step 2 Install Codex CLI and configure Ollama endpoint
npm install -g @openai/codex
ollama launch codex --model gemma3:4b
Step 3 Navigate to your project directory and start Codex
cd ~/Desktop/my-project
codex
Step 4 Codex scans your project, builds context map > Working... analyzing file tree, reading imports, mapping dependencies
Step 5 — You're live! Ask it anything about your codebase > "Find and fix the null pointer bug in auth.js"
🔍 Real-World Observation

Running Codex CLI with Gemma3:4b on a 16GB RAM laptop uses approximately 5-8GB of RAM for the model, and pushes CPU to 70–100% during inference. Responses arrive in 10–30 seconds per turn. It works — but requires patience and benefits enormously from GPU acceleration or higher-end hardware. For production use, 32GB RAM or an NVIDIA GPU with 8GB+ VRAM is recommended.

Quick Reference — Codex CLI Commands

Command Description
ollama launch codex --model <name> Start Codex CLI with a specific Ollama model as the backend
scan the project Ask Codex to analyze your codebase and build understanding
find and fix a bug in @filename Ask Codex to diagnose and patch bugs in a specific file
write tests for @filename Generate unit tests for a given module
/model Switch to a different Ollama model mid-session
Esc Interrupt a running inference

Quick Reference — Ollama Commands

Command Description
ollama run <model> Start an interactive chat session with a model
ollama pull <model> Download a model from the Ollama library
ollama list Show all downloaded models
ollama ps Show currently loaded models and resource usage
ollama rm <model> Delete a model from local storage
ollama create <name> -f Modelfile Create a custom model from a Modelfile
ollama show <model> Display model metadata, parameters, and Modelfile
ollama serve Start the Ollama server manually (usually auto-started)

Go deeper

The local AI ecosystem moves fast. Here are the best places to keep learning:

🦙
Official Docs
Ollama Documentation

Complete reference for commands, API, Modelfile format, and GPU setup guides.

ollama.com/docs
🧪
Model Library
Ollama Model Hub

Browse and search 100+ models with sizes, benchmarks, and pull commands.

ollama.com/library
🐙
Open Source
Ollama on GitHub

Source code, issues, community integrations, and contribution guides.

github.com/ollama/ollama
🖥️
GUI Tool
Open WebUI

A beautiful ChatGPT-style interface that runs locally on top of Ollama.

github.com/open-webui/open-webui
💻
VS Code Extension
Continue.dev

Integrate Ollama with VS Code or JetBrains for AI autocomplete and chat.

continue.dev
📰
Community
r/LocalLLaMA

The most active community for local LLM enthusiasts. Tips, benchmarks, model comparisons.

reddit.com/r/LocalLLaMA
📚
Research
Hugging Face

The home of open model weights, datasets, and leaderboards for model comparison.

huggingface.co
Inference Engine
llama.cpp

The C++ engine that powers Ollama under the hood. For advanced users who want direct control.

github.com/ggerganov/llama.cpp
🤖
Harness Tool
Codex CLI

OpenAI's terminal coding agent. Works with Ollama as an open-weight backend.

github.com/openai/codex
Term Meaning
LLM Large Language Model — a neural network trained on text to predict and generate language (GPT, Gemma, Llama, etc.)
Inference Running a trained model to generate outputs. "Local inference" means your CPU/GPU does this, not a remote server.
Quantization Compressing model weights from 32-bit floats to 4-bit integers, reducing RAM requirements ~4–8x with minimal quality loss.
GGUF The file format Ollama uses to store quantized models. Designed for efficient CPU inference.
Context Window How many tokens (words) the model can "see" at once. Larger = more memory needed. Configured via num_ctx.
Modelfile A configuration file for customizing a model's behavior — like a Dockerfile for AI. Defines system prompt, parameters, etc.
Open-weight Models whose weights (parameters) are publicly released. "Open source AI" — you can download and run them yourself.
Harness tool A CLI or app that wraps a model API and gives it agentic capabilities (file access, code execution, tool calling).
tok/s Tokens per second — a measure of inference speed. 10 tok/s ≈ ~7 words/second. Cloud APIs do 50–100+ tok/s.